Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

56 ◾ Bioinformatics

transform (BWT), and Full-text Minute-space (FM-index). Aligners implement a variety

of searching algorithms to determine where short reads originated in the indexed refer-

ence genome sequence. Aligning of a read to a genomic location depends on the sequence

similarity. However, a good aligner should expect mismatches due to the sequencing errors

and genetic variation between the reference genome and a sequenced individual. In the

following, we will discuss these commonly used data structures for indexing and popular

searching methods.

2.2.1 Trie

A trie or a prefix tree is a data structure that is used for fast retrieval on large datas-

ets such as looking up sequencing reads. The name was derived by E. Fredkin in 1960

from the phrase “Information Retrieval” by Fredkin (1960) [5]; however, the idea was

first described by the Norwegian mathematician Axel Thue in 1912. A trie is an ordered

tree that can be used to represent a set of strings (reads) over a finite alphabet set, which

is A, C, G, and T for the DNA sequences. It allows reads with common prefixes to use

that prefix and stores only the ends of the reads to indicate different sequences. A trie is

represented by nodes and edges. The root node is empty and then each node in the trie

represents a single character. The maximum number of possible children of a node is four

in case of DNA. The children of a node are ordered alphabetically. The leaf nodes are the

nodes that represent the last characters of reads. A read is represented by the path from

the root node to a leaf node.

The trie data structure, as it is, is not suitable for indexing a genome sequence since

it stores a set of string. However, the generalized idea of the trie is used in the suffix tree

which is widely used to index a reference genome sequence.

2.2.2 Suffix Tree

The suffix tree, as a generalization of the trie data structure, is basically used for pattern

matching and finding substrings in a given string of characters. It is constructed as key-

value pairs (as the python dictionary) where all possible suffixes of the reference sequence

are the keys and positions (indexes) in the sequence as their values. In 1995, Esko Ukkonen

proposed a linear-time algorithm called Ukkonen’s algorithm [6], which constructs a suf-

fix tree in a linear time complexity that the time taken for the indexing increases linearly

with the increase in the number of nodes O(n).

For the sake of simplicity, we will try to show you how to build a suffix tree for a refer-

ence sequence and how mapping of the reads is carried out. Assume that our reference

sequence consists of 10 bases as “CTTGGCTGGA$”, where the positions of the bases

are 0, 1, 2, …, 9 and $ is an empty trailing position. The first step of constructing a suf-

fix tree is by forming suffixes (keys) and indexes (values) pairs from the sequence. We

begin from the last character in the sequence, which is the empty character “$” in the

position 10 in the sequence. Then, we form the suffix “A$”, whose first character is in

position 9 in the sequence. We continue this way until all possible suffixes are created

as shown below: